Final Project: A protein expression analysis of Breast Cancer

Group 23

Introduction

The data set consists of:

  • iTRAQ proteome profiling of 77 breast cancer samples + 3 healthy samples, with expression values for ~12.000 proteins for each sample.

  • A file containing the clinical data of the 77 breast cancer patients (TCGA ID, sex, age, tumor receptors, etc.).

  • A file containing the list of genes and proteins used by the PAM50 classification system.

The analysis of this data set is relevant for multiple potential applications: expression analysis to identify biomarkers, understand disease heterogeneity, and infer personalized treatment strategies in breast cancer.

Materials and methods

Workflow

Materials and methods

Description of relevant variables

  • Dropped variables (redundant or not relevant): Survival.Data.Form, Days.to.date.of.Death, Days.to.Date.of.Last.Contact, OS.Time, Vital.Status, Tumor..T1.Coded, Metastasis.Coded, AJCC.Stage, Converted.Stage and all the columns destined to cluster the data

  • Created variables:

    • Age.Ini.Diagnostic.group: intervals of 10 years starting from 30 and going up until 90.

    • Age.Menopausal.group: [30, 45) Pre-menopausal, [45, 55) Menopausal, [55, 90) Post-menopausal.

    • ER_PR_HER2: level from 0 to 7 depending on hormonal receptors (ER, PR) present an the level of HER2.

    • TNBC: 0 if positive, 1 if negative

    • AJCC.Simp: simplified AJCC stages (I, II, III, and IV)

Materials and methods

Description of relevant variables

Analysis 1: PCA

Analysis 2: Kmeans

# Perform k-means clustering
kmeans_PAM50_5 <- PAM50_data |>
  kmeans(centers = 5, 
         iter.max = 10)

# Define the hulls for getting the points that define the shapes of each cluster
hulls <- PAM50_cluster_data |>
  group_by(Cluster_5) |>
  group_modify(~ .x[chull(.x$PC1, 
                          .x$PC2), 
                    ])

# Plot the final comparison
PAM50_cluster_data |>
  ggplot(mapping = aes(PC1,
                       PC2,
                       fill = factor(Cluster_5))) +
    scale_fill_manual(values = Cluster_colors) +
    scale_color_manual(values = PAM50_colors) +
    geom_point(mapping = aes(color = PAM50.mRNA),
               size = 1) +
    geom_polygon(data = hulls,
                 alpha = .1) +
    guides(fill = "none",
           color = guide_legend(title = "PAM50 classes")) +
    labs(title = "PAM50 classification compared to the predicted clusters\n(K-means clusters represented as a grey contour)",
         x = "Fitted PC1 (39.54 %)", 
         y = "Fitted PC2 (14.55 %)") +
    theme_light() +
    theme(plot.title = element_text(hjust = 0.5))

Analysis 3: Differential expression

Two functions were created allowing to easily perform several comparisons with the present data.

DEA_proteins()

DEA_proteins <- function(data_in, condition_test){
  col_name <- deparse(substitute(condition_test))
  data_long <- data_in |>
    dplyr::select(matches("^NP"),
                  matches("^XP"),
                  matches("^YP"),
                  {{ condition_test }}) |>   
    pivot_longer(cols = -{{ condition_test }},
                 names_to = "Protein",
                 values_to = "log2_iTRAQ")
  
  data_long_nested <- data_long |>
    group_by(Protein) |>
    nest() |>
    ungroup()
    
  data_w_model <- data_long_nested |>
    group_by(Protein) |>
    mutate(model_object = map(.x = data,
                              .f = ~lm(formula = str_c("log2_iTRAQ ~", col_name) ,
                                       data = .x)))

  data_w_model <- data_w_model |>
    mutate(model_object_tidy = map(.x = model_object,
                                   .f = ~tidy(.x,
                                              conf.int = TRUE,
                                              conf.level = 0.95)))
  
  estimates <- data_w_model |>
    unnest(model_object_tidy) |>
    filter(term == col_name) |>
    ungroup() |>
    dplyr::select(Protein, p.value, estimate, conf.low, conf.high) |>
    mutate(q.value = p.adjust(p.value)) |>
    mutate(dif_exp = case_when(q.value <= 0.05 & estimate > 0 ~ "Up",
                               q.value <= 0.05 & estimate < 0 ~ "Down",
                               q.value > 0.05 ~ 'NS'))

  plt_volcano <- volcano_plot(estimates, col_name)
  return(list(estimates=estimates, plt_volcano=plt_volcano))
}

volcano_plot()

volcano_plot <- function(data, condition_test){
  plt <- data |>
    group_by(dif_exp) |>
    mutate(label = case_when(dif_exp == "Up" ~  str_c(dif_exp,
                                                       " (Count: ",
                                                       n(),
                                                       ")" ),
                             dif_exp == "Down" ~  str_c(dif_exp,
                                                         " (Count: ",
                                                         n(),
                                                         ")" ),
                             dif_exp == "NS" ~  str_c(dif_exp))) |>
    ggplot(aes(x = estimate,
               y = -log10(p.value),
               colour = label)) +
    geom_point(alpha = 0.4,
               shape = "circle") +
    labs(title = str_c("Differentially expressed proteins in the test: ",
                        condition_test,
                        " vs. Non-",
                        condition_test),
         subtitle = "Proteins highlighted in either red or blue were 
         \nsignificant after multiple test correction",
         x = "Estimates", 
         y = expression(-log[10]~(p)),
         color = "Differential expression") +
    scale_color_manual(values = c("blue",
                                  "grey",
                                  "red")) +
    theme_minimal() +
    theme(legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) 
  return(plt)
}

Analysis 3: Differential expression

NP_002094.2: glycogen [starch] synthase, muscle isoform 1

Conclusions

  • Triple Negative Breast Cancer (TNBC) individuals show a different protein expression profile compared to Non-Triple Negative Breast Cancer.

  • The protein expression profiles from breast cancer affected individuals can be differentially clustered into the PAM50 gene classification system of breast cancer subtypes.

  • There are at least 84 proteins down-expressed and 115 up-expressed in TNBC individuals when comparing them to non-TNBC.

  • There seems to be no differential expression between breast cancer tumors allocated or not allocated in lymph nodes.

    • Only 1 protein (NP_002094.2: glycogen [starch] synthase, muscle isoform 1) seems to be significantlly down-expressed in individuals with breast cancer tumors allocated in lymph nodes.

Discussion and further analysis

  • A deeper study on the down- and up-expressed proteins from the TNBC VS Non-TNBC differential expression analysis could be interesting in order to define the functions and genes difference between the two type of breast cancer samples.

    • Gene-set enrichment analysis could be carry out for this purpose.
  • K-means can be developed as an unsupervised-method for medical diagnosis of Basal-like breast cancer tumors.

    • This could be relevant for detecting more dangerous subtypes of breast cancer and develop a medical treatment more precise focused on treating this subtype of breast cancer.
  • This project could be run again with a more accurate proteome dataset that includes, if possible, all the PAM50 genes and as few NA’s as possible.